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(54) Instruction prefetch unit 




(57) A prefetch buffer is described which supports a 
computer system having a plurality of different instruc- 
tion modes. The number of storage locations which are 
read out of the prefetch buffer during each machine cy- 


cle is controlled in dependence on the instruction mode. 

Thus the prefetch buffer allows a number of.different 
instruction modes to be supported and hides memory 
access latency. 
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Description 

[0001] The present invention relates to a prefetch unit 
for use in a computer system. 

[0002] I n a computer system, instructions are typically 
fetched from a program memory, decoded and supplied 
to an execution unit where they are executed to run the 
program stored in the program memory. If more than one 
execution unit is provided, it is possible to arrange for 
very high speed instruction execution. In order, to take 
advantage of this, it is clearly necessary to.be able to 
supply decoded instructions to the execution unit at a 
sufficient rate. Presently, access times to memory can- 
not match execution speeds, and therefore several ma- 
chine cycles are needed to access each new instruction 
from memory. Thus, there can be a severe performance 
degradation because the fetches from memory cannot 
match the rate at which instructions can be executed by 
the execution units. 

[0O03] According to the present invention there is pro- 
vided a prefetch buffer for holding instructions in a proc- 
essor having a memory and an instruction decode unit, 
the prefetch buffer comprising: 

a plurality of storage locations, each having the 
same bit capacity (2n bits) and arranged in groups 
with the same number p of storage locations in each 
group; 

a write port for selectively writing words of bit length 
n x p from the memory into respective groups of the 
prefetch buffer; 

read circuitry for reading instructions out of the 
prefetch buffer in dependence on an instruction 
mode of the processor, said instruction mode con- 
trolling the number of storage locations which are 
read during a machine cycle; and 
means for indicating when all storage locations in a 
group have been read so that a fetch signal can be 
issued to fetch a next word from the memory into 
the storage locations of that group. 

[0004] In the described embodiment, each storage lo- 
cation has a capacity of 32 bits, and are arranged in 
groups of four such that each group has a capacity for 
a 1 28 bit word read out of memory on a memory fetch. 
In the described embodiment, four groups of storage lo- 
cations are provided in the prefetch buffer, thus allowing 
for up *o four successive memory accesses even if the 
first w.rd has not yet been either received or executed. 
Moreover, because the processor supports more than 
one instruction mode, the time which it takes to read all 
storage locations in a group in terms of machine cycles 
can vary. According to the invention, the indicating 
means allow for a next word to be fetched from memory 
when all storage locations in a group have been read, 
however many machine cycles that has taken. Thus, 
memory latency is hidden through this mechanism. 
[0005] According to a first instruction mode, one stor- 



age location is read out during each machine cycle to 
provide a pair of 16 bit instructions to the decode unit 
(referred to herein as GP16 mode). 
[0006] According to a second instruction mode, two 

5 storage locations are read during each machine cycle 
to provide two 32 bit instructions to the decode unit (re- 
ferred to herein as GP32 mode). ' 
[0007] According to a third instruction mode, four stor- 
age locations are read out during each machine cycle 

io to provide four instructions each of 32 bits to the decode 
unit (referred to herein as VLIW (Very Long Instruction 
Word) mode). 

[0008] In the described embodiment, the indicating 
means comprises a set of flags, each group having a 
is flag associated therewith which is set to indicate that all 
. storage locations in the associated group have been 
read so as to initiate a subsequent memory fetch. 
[0009] The invention also provides a prefetch unit 
comprising a prefetch buffer as hereinabove defined 

20 and control circuitry arranged to monitor the indicating 
means and to issue a fetch signal to memory to fetch 
the next word into the prefetch buffer when all storage 
locations in a group have been read. The control circuit- 
ry can include an aligner for controlling a read pointer 

2£ determining the storage locations to be read in a next 
machine cycle. ' 
[0010] For a better understanding of the present in- 
vention and to show how the same may be carried into 
effect, reference will now be made by way of example 

30 to the accompanying drawings, in which:- 



Figure 1 is a block diagram of a prefetch unit; 
Figure 2 illustrates the different instruction modes 
of the processor; - 
Figure 3 illustrates the organisation of a prefetch 
buffer; and 

Figure 4 is a schematic diagram illustrating the op- 
eration of the prefetch buffer. 



3S 



^0 [0011] Figure 1 is a block diagram of a prefetch unit 2 
for a processor, the prefetch unit 2 comprising a prefetch 
buffer 4 with associated control bits 6 and control circuit- 
ry comprising a pref etcher 8 and an aligner 10. The 
pref etcher 8 is connected to a program memory 1 2 and 
4S js responsible for initiating memory accesses to the pro- 
gram memory 12 using memory access control signals 
14a, 14b. The address in memory to which a fetch is in- 
itiated is held in a prefetch program counter 16 in the 
prefetcher 8. Control of the prefetch program counter is 
so not discussed herein, but it can be assumed that fetches 
are initiated from memory in accordance with a se- 
quence of instructions to be executed by the processor. 
That is, the prefetch program counter may be'increment- 
ed each time as a sequence of adjacent instructions is 
ss fetched, or it may change according to branches, traps, 
interrupts etc. Responsive to a memory fetch initiated 
by the prefetcher, instruction words are supplied from 
the program memory 1 2 to the prefetch buffer 4 as rep- 
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resented by data-in path 18: 

[0012] The aligner 10 controls reading of instructions 
from the prefetch buffer to a decoder 20 along data-out 
path 22. To do this, the aligner issues and is responsive 
to prefetcher align (PFAL)/decoder control signals 24a, $ 
24b The aligner 10 has an align program counter 26 
which keeps track of how many instructions have been 
dispatched to the decoder 20 in each machine cycle, 
and h state machine 28 which generates a read pointer 
RP for controlling the prefetch buffer in a manner which io 
.is described in more detail hereinafter. 
[0013] Instructions in the program memory 12 can 
h*ve a length of 16 bits or 32 bits. The prefetch buffer, 
supports three different instruction modes as described 
with reference to Figure 2 as follows. The instruction 75 
moae is held m h process status register (PSR) 3 and 
Crtn bo ctwiQod Change mode signals chmdl ,chmd2 
arc tssu:x3 Dy the decoder 20 responsive to a change in 
instruct on tkxW ' - . 

[0014] Accufjmg to o lirst instruction mode, a pair of 20 
16 Ur Bi^tf u.t 1 l' supplied during each machine cy- 
/clc to tro 00c odor £0 from the prefetch buffer 4. This 
pair is oorKAoc siotC sk>ti in bit sequences w0,w1 etc. 
Thto 10 fetched to herein as GP16 mode. 
[0015] According 10 a second instruction mode, two 2s 
instruction* o^cn h*ivBig a length of 32 bits are supplied 
to the accost ircm the prefetch buffer in each machine 
cycle icr ei^r-ipio wi in CYCLE 0. This mode is re- 
ferred to horpm ^ts GP32 mode. 

[001 6] Acco*c3mg to h third instruction mode, four in- .30 
structions w0 wi w2.w3 each of 32 bits in length are 
supp tod to the decoder in each machine cycle. This is 
referred to herein as VLIW. 

[0017] In nil modes, each fetch operation initiated to 
the progf nrr\ rrcmory 1 2 retrieves an instruction word of 35 
128 bits n *on 3th 

[0018] Thus tn GF 16 mode : the instruction word com- 
prises c-gh: 16 bit instructions, paired as slotO.slotl for 
each machine cycte In GP32 and VLIW mode, the in- 
struction word comprises four 32 bit instructions. 40 
[0019] The organisation of the prefetch buffer 4 is il- 
lustrated m figure 3. In diagrammatic terms, the 
prefetch buMcr o,n be considered to have four succes- 
sive lines L0 to L3 each having a capacity of 128 bits. 
There is a single write pon WPO having a width of 1 28 ^5 
bits which recedes dat* from the program memory via 
the data-in p^:n and ar. input latch FF-in and writes 
it into the selected Imu under the control of a write point- 
er WP [3 Oj Enct iuiu comprises four storage locations 
each having a capacity of 32 bits and each of which is so 
shown diagrammatical ly divided into two 16 bit sections 
for the .purposes of explanation. The storage locations 
are denoted FD to F15. Each line in Figure 3 is referred 
to herein as a group of storage locations and has the 
capacity tor one 12S bit line from memory. This allows 
up to four successrvc memory accesses to be made, 
even if the firs.1 instruction word has not been received 
or executed by Ihe processor While the instruction word 



in storage locations F0 to F3 is being decoded and sub- 
sequently executed, memory fetches can continue to be 
. implemented into the storage locations F4 to F7, F8 to 
F11 and F12 to F15 until the buffer is full. By the time 
that a memory fetch has been made into the last group 
. F12 to F15, it is most likely that the first group F0 to F3 
will have been completely read out into the decoder and 
will thus be ready to receive a subsequent instruction 
word from memory. The number of cycles required to 
. decode an instruction word in each group varies de- 
pending on the instruction mode of the machine in a 
manner which will be described in more detail in the fol- 
lowing. Nevertheless, a minimum of one cycle is re- 
quired for reading and decoding, and therefore the use 
of the prefetch buffer hides memory latency. 
[0020] In order to save a cycle when the prefetch buff- 
er is empty or flushed after a branch, data can bypass 
the prefetch buffer through a bypass circuitry BS. As de- 
scribed in more detail later, the bypass circuitry is im- 
plemented as a plurality of multiplexors (MUX0 to MUX3 
in Figure 4). 

[0021] Figure 4 is a more detailed diagram of the 
prefetch buffer and its associated read circuitry. The 
storage locations F0 to F15 are illustrated aligned ver- 
tically for the purposes of explanation. 
[0022] The control bits 6 described above in Figure 1 
include empty flags EF1 to EF4 which indicate when a 
complete 128 bit line of storage locations is empty such 
that a subsequent memory fetch can be initiated. When 
a fetch is instituted from memory, and data has been 
received by the prefetch buffer, the empty flag is cleared 
to indicate that those storage locations are nowfulL 
[0023] Reading from the prefetch buffer will now be 
described with reference to the schematic diagram of 
Figure 4. The prefetch buffer includes four read ports 
RP1 , RP2.RP3 and RP4. These read ports each take the 
form of multiplexors each capable of connecting select- . 
ed ones of the storage locations F0 to F15 to a 32 bit 
output, pf-buf-out1,2,3or4. However, the read ports are 
not identical. The first read port RP1 has sixteen inputs 
each of which is connected to a respective storage lo- 
cation F0 to F15 and each of which can be connected 
to the output pf-buf-outl . The second read port RP2 has 
eight inputs which are respectively connected to storage 
locations F1,F3 ( F5,F7,F9,F11,F13,F15 to selectively 
connect the contents of those storage locations to the 
output pf-buf-out2. 

[0024] The third read port RP3 has four inputs con- 
nected to storage locations F2,F6,F10 and F14 for se- : 
lectively connecting the contents of those storage loca- 
tions to the output pf-buf-out3. The fourth read port RP4 
also has four inputs which are connected to storage lo- 
cations F3,F7,F11 and F15 for selectively connecting 
the contents of those storage locations. to the output pf- 
buf-out4. 

[0025] The read ports RPl to RP4 are controlled by 
the read pointer RP from the aligner 10 in dependence 
on the instruction mode of the machine and the conse- 
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quential number of machine cycles required for decod- 
ing each instruction word. 

[0026] Alternatively, for instructions supplied directly 
from memory along data-in path 18, the control of in- 
structions supplied to the decoder in dependence on the 
instruction mode and machine cycles is additionally con- 
trollable by multiplexors MUX0.MUX1.MUX2 and 
>MUX3. These receive at their input respective bits of the 
128 bit data-in path 18 to supply a 32 bit sequence to 
each multiplexor in each machine cycle as described in 
the following. 

[0027] The selection of which instructions within the 
. instruction word are supplied to the decoder 20 is made 
on dependence on the instruction mode as described in 
the following. In Figure 3, the symbols wO to w3 are used 
on different input lines of the multiplexors MUXO to 
MUX3 to represent different 32 bit sequences, as in Fig- 
ure 4. The definition of each 32 bit sequence depends 
on the instruction mode, but bits of the data-in path are 
always allocated as wO [0:31], wl [32:63], w2 [64:95], 
w3 [96:127]. The inputs to the multiplexors are individ- 
ually labelled so as to distinguish between them. That 
is, in GP16 mode, on the first decode cycle, cycle 0, the 
first sequence wO is supplied to the decoder 20. This 
presents a pair of 16 bit instructions, slotO.slotl (wO) for 
simultaneous decoding by the decoder 20. On the next 
cycle, cycle 1, the sequence wl is supplied, presenting 
the next pair of 16 bit instructions slot0,slot1 (w1) for 
decoding. In GP16 mode, the read port RP1 and the 
multiplexor MUXO are the only read devices which are 
used and the control of the word which is supplied to the 
decoder is made by the multiplexor MUXO under the 
control of signal mux-ctrlO, and the read pointer RP If 
the signal mux-ctrlO selects the read port output pf-buf- 
outl , the read pointer selects inputs F0 to F3 over four 
successive cycles CYCLE0 to CYCLE3to read out suc- 
cessively wO to w4. Once storage "location F3 has been 
read out, the read port counter will reset the read port 
RP1 so that it reads out from storage locations F4 to F7 
over the next four cycles. If the buffer is not in use, the 
first instruction pair wO is read out by the multiplexor 
MUXO. That is in cycle 0, input MOO of the multiplexor 
MUXO is selected. Meanwhile, the 128 bit line is loaded 
into the first location of the prefetch buffer and the- read 
pointer points to the next location to be read out by the 
decoder. Therefore on cycle 1 , the next instruction pair 
wl is read out by the multiplexor MUXO by selecting pt- 
buf-outl 

[0028] In GP32 mode, in the first machine cycle the 
first two instructions w0,w1 are presented to the decoder 
20. In the subsequent cycle, cycle 1, the next two in- 
structions w2,w3 are presented to the decoder. This uti- 
lises read ports RP1 and RP2 and the multiplexors 
MUXO and MUX1. If the signal mux-ctr10 is set to pf- 
buf-outl, and mux-ctrl1 to pf-buf-out2, then the read 
pointer RP is set to F0 for RP1 and F1 for RP2 in cycle 
0. In cycle 1 , it is changed to F2 and F3 respectively. 
Instructions are then read over the next two cycles from 



the next group of storage locations F4 to F7 by altering 
the setting of the read ports RP1 and RP2 responsive 
to the read pointer RP. Alternatively, when read from the 
data-in path 18, in the first cycle, the first input M10 of 
5 the multiplexor MUX1 is set to read w1 (bits 31 to 63) 
and the first input MOO of the multiplexor MUXO is set to 
read wO (bits 0 to 31 ). Thus, instructions wO and wl are 
presented to the decoder 20 in CYCLE 0. Meanwhile, 
the 1 28 bit line is loaded into the prefetch buffer so that 
to in the subsequent cycle, CYCLE 1 , w2 and w3 are read 
from the buffer by selecting pf-buf-outl and pf-buf-out2. 
[0029] In VLIW mode, four 32 bit instructions W0 to 
W3 (slotO to slot3) are supplied simultaneously to the 
decoder 20 in each machine cycle, e.g. CYCLE 0. The 
is multiplexors MUX2 and MUX3 are set according to the 
control signals mux-ctrl2 and mux-ctrl3 respectively to 
allow the instruction words w2 and w3 to be read either 
. from the buffer or from the data-in path 18. In other re- 
spects, the settings of RP1 and RP2, MUXO and MUX1 
20 are as in GP32 mode. However, in the subsequent cy- 
cle, e.g. CYCLE 1 in VLIW mode, it will be noticed that 
the instruction words w2 and w3 which would have been 
remaining in GP32 mode have now been read out. 
therefore, the read pointer RP can immediately move 
2S on to the next set of storage locations F4 to F7 to read 
out the subsequent VLIW instruction word containing 
the next four instructions. 

[0030] Data is passed from the multiplexors MUXO to 
MUX3 to respective output flip-flops FF0 to FF3 via a 

30 set of control gates labelled GC1, GC2 and GS0 to GS3. 
The control gates GC1,GC2 are responsive to change 
mode signals chmdl ,chmd2 respectively which indicate 
to the prefetch unit that there has been a change in the 
, instruction mode in which the machine is operating. The 

3S * control gates GS0 to GS3 are responsive to respective 
stop signals stop[0] to stop[3) to prevent any new data 
from entering the decoder from that output flip-flop. 
These effectively allow the decoder to be stalled. In a 
stop condition, the outputs of the flip-flops are recircu- 

40 lated to the input of its associated control switch to pre- 
vent unnecessary operation of the subsequent decoder. 
[0031] Operation of the prefetch unit responsive to the 
change mode signals chmdl and chmd2 will now be de- 
scribed. The output flip-flop FF0 is connected to a single 

4 5 32 bit decoder and to two 16 bit decoders. When the 
machine is in G PI 6 mode, the outputs of the two 16 bit 
decoders are selected for the instruction pair supplied 
to the flip-flop FF0. When the machine is in GP32 mode, 
the output of the 32 bit decoder is selected. The remain - 

so ing flip-flops FF1 to FF3 are each connected to respec- 
tive 32 bit decoders. 

[0032] A first change mode signal chmdl signals a 
change of machine instruction mode from GP32 to 
GP16. If the machine had been operating in GP32 
55 mode, consider the situation at the end of cycle 0 which 
reference to Figure 2. Instructions wO and wl will have 
been supplied via the flip-flops FF0 and FF1 to the re- 
spective 32 bit decoders of the decoder 20. However, 
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the change in instruction mode now implies that the 32 
bit sequence which was formerly to.be considered as 
the second instruction W1 in cycle 0 of GP32 mode, in 
tact contains a pair of 16 bit instructions as denoted in 
cycle 1 of GP16 mode. Thus, the output of the 32 bit s 
decoder connected to the flip-flop FF1 needs to be ig- 
nored, and the 32 bit sequence w1 needs to be reapplied ■ 
to the two 16 bit decoders connected to the output flip- 
. flop FO. This is achieved by the recirculation line 42 from 
the output of the flip-flop FF1 to the input of the control \o 
gateCGI. . 

[0033] Conversely, control signal chmd2 denotes a 
, change of instruction mode from GP16 to GP32, Con- 
sider again the effect at the end of cycle 0 with reference 
to Figure 2. The instruction pair denoted wO has just is 
been decoded in GP16 mode, and the expectation is 
that the machine will now wait for the next instruction 
pair wl. However, in GP 32 mode, that word wT repre- 
sents a single instruction and the change mode signal 
chmd2 allows it. to be applied directly through the control 20 
gate GC2 1o the output flip-flop FF1 so that it can be 
applied directly to the input of the dedicated 32 bit de- 
coder connected to the output of the flip-flop FF1 . This 
allows the instruction w1 to be decoded as a single 32 
bit instruction. In the next cycle, instructions' w2 and w3 " 25 
can be transmitted normally as indicated by cycle 1 in 
GP32 mode in Figure 2. 

[0034] It will -be clear from the above that the number 
of cycles needed to read all four storage locations in a 
group depends on the instruction mode. That is, in GP1 6 30 
mode, four cycles are needed, in GP32 two cycles are 
needed and VLIW one cycle is needed. When all the 
storage locations FO to F3 in the first group. have been 
read, the first empty flag EFl is cleared to empty. 
[0035] The aligner controls the setting and clearing of 35 
the 'empty" flags using information from the read point- 
er. The aligner detects when the read pointer goes from 
one line (128 bits) to the next. When this occurs, the 
"empty" flag corresponding to the page which has just 
been read is set. 40 
[0036] The state of an empty flag being cleared is de- 
tected by the prefetcher 8 along line 48 and a fetch is 
initiated to the next prefetch address in the prefetch pro- 
gram counter 16. Thus, the next instruction line is ^ 
fetched from memory and the write pointer WP is set to 
write it into storage locations FO to F3. In the meantime, 
the read pointer has moved to the second group F4 to 
F7 to read and decode instructions of that group. When 
those storage locations are empty, the empty flag EF2 
is cleared, a next memory fetch is initiated by the so 
prefetcher 8 and the read pointer moves onto the group 
F8 to F11 . As can readily be seen, the prefetch buffers 
masks a latency of memory fetches of at least three cy- 
cles in the VLIW mode, and a greater number of cycles 
in GP32 and GPl 6 mode. Signals are supplied from the ss 
decoder along line 24b to the aligner 1 0 indicating what 
mode the decoder is operating in so that the aligner can 
adjust the align program counter 26 accordingly and 



keep track of the next instructions to be decoded so that 
the read pointer RP can correctly be issued by the state 
machine 28. 



Claims 

1. A prefetch buffer for holding instructions in a proc-. 
essor having a memory and an instruction decode 
unit, the prefetch buffer comprising: 

a plurality of storage locations, each having the 
same bit capacity (2n bits) and arranged in 
groups with the same number p of storage lo- 
cations in each group; 

a write port for selectively writing words of bit 
length n x p from the memory into respective 
groups of the prefetch buffer; 
read circuitry for reading instructions out of the 
prefetch buffer in dependence on an instruction 
mode of the processor, said instruction mode 
controlling the number of storage locations 
which are read during a machine cycle; and 
means for indicating when all storage locations 
in a group have been read so that a fetch signal 
can be issued to fetch a next word from the 
memory into the storage locations of that group. 

2. A prefetch buffer according to claim 1 , wherein the 
read circuitry is responsive to a first instruction 
mode to read out one storage location during each 
machine cycle to provide two instructions each of 
bit length n to the decode unit (GPl 6). 

3. A prefetch buffer according to claim 1 or 2, wherein 
the read circuitry is responsive to a second mode 
of operation to read out two storage locations during 
each machine cycle to provide two instructions 
each of bit length 2n to the decode unit (GP32). 

4. A prefetch buffer according to claim 1 or 2, wherein 
the read circuitry is responsive to a third mode of 
operation to read out four storage locations during 
each machine cycle to provide four instructions 
each of bit length 2n to the decode unit (VLIW). 

5. A prefetch buffer according to any preceding claim, 
wherein said indicating means' comprises a set of 
flags, each group having a flag associated therewith 
which is set to indicate that all storage bcations in 
the associated group have been read. 

B. A prefetch buffer according to any preceding claim, ■ 
wherein the number of storage locations in the 
prefetch buffer matches the latency of a memory 
fetch operation. . 

7. A prefetch unit comprising: 
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a prefetch buffer according to any preceding 
claim; and 

control circuitry arranged to -monitor said indi- 
cating means and to issue a fetch signal to 
memory to fetch said next word into the 5 
prefetch buffer when all storage locations in a 
group have been read. 

8. A prefetch unit according to claim 7, wherein the 
control circuitry includes an aligner for controlling a io 
read pointer- determining the storage location to be 
rc^id in the present machine cycle. 
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